Local Winter Weather Analysis

As a skier, who grew up in the Western, NY/Southern Tier area, and stayed here, I have personally noticed the shortening and worsening (for those of use who like winter) of the winter season in my 33 years, especially now that I am coaching and need to pay particular attention to the conditions from day-to-day, or even hour-to-hour, to ensure not only a useful environment for training, but a safe one. Worse, only is the season getting shorter on both ends, not starting as early in December, or even in January anymore, but ending early in March instead of later into the month. With this in mind I am interested in seeing whether what I have been experiencing is purely anecdotal, or if I am having memory issues, or if there is any truth to this.

The Data In Question

NOAA, the National Oceanic and Atmospheric Administration has tools available on their website to download public weather station data from around the USA. By using the "Search Tool" I found data from 3 different weather stations that are local and significant to me in Western NY and the Souther Tier: Bath, NY; Dansville, NY; Alfred, NY. Each station has a certain "Data Coverage" percentage for each particular data field that are available. There is, unfortunately, nothing that we can do to remedy this but we can keep it in mind and and put any missing data into context. Their footnote regarding coverage states "Coverage is an approximation of total completeness based on the most complete data element, and the overall data range."

NOAA provides methods to pull data via APIs but right now I am just going to download CSVs of the data and use it here.

In each of NOAA's stations' "Daily Summaries Station Details" they list a data documentation document which also lists a supplementary document.

Weather Stations

Alfred, NY

Alfred has a single station with data available from 1893/01/01 to 2024/08/25

- ALFRED, NY US - GHCND:USC00300085 - 1893/01/01 to 2024/08/25 - 91% Coverage

Pertinent data available from Alfred, NY

Air Temperature

  • TMAX - Maximum Temperature - 90% Coverage
  • TMIN - Minimum Temperature - 91% Coverage
  • TOBS - Temperature at time of observation - 87% Coverage

  • Precipitation

  • PRCP - Precipitation - 87% Coverage
  • SNOW - Snowfall - 80% Coverage
  • SNWD - Snow Depth - 75% Coverage
  • Bath, NY

    The Bath, NY data is comprised of data in 6 stations in different locations since 1953.

  • BATH, NY US - GHCND:USC00300448 - 1953/08/01 to 2014/09/30 - 91% Coverage
  • AVOCA 5.0 SSW, NY US GHCND:US1NYST0023 - 2008/01/11 to 2011/09/29 - 91% Coverage
  • BATH 4.2 E, NY US GHCND:US1NYST0033 - 2010/05/14 to 2024/08/25 - 47% Coverage
  • BATH 0.4 N, NY US GHCND:US1NYST0029 - 2010/09/01 to 2012/07/12 - 36% Coverage
  • BATH 6.8 SW, NY US GHCND:US1NYST0035 - 2016/04/02 to 2022/03/31 - 36% Coverage
  • BATH 1.3 E, NY US GHCND:US1NYST0044 - 2020/05/01 to 2024/08/26 - 33% Coverage
  • Pertinent data available from Bath, NY

    Air Temperature

  • TMAX - Maximum Temperature - 88% Coverage
  • TMIN - Minimum Temperature - 88% Coverage
  • TOBS - Temperature at time of observation - 89% Coverage

  • Precipitation

  • PRCP - Precipitation - 91% Coverage
  • SNOW - Snowfall - 90% Coverage
  • SNWD - Snow Depth - 100% Coverage
  • Dansville, NY

    Dansville has a single station with data available from 1917/07/01 to 2024/08/25

    - DANSVILLE, NY US - GHCND:USC00301974 - 1917/07/01 to 2024/08/25 - 95% Coverage

    Pertinent data available from Dansville, NY

    Air Temperature

  • TMAX - Maximum Temperature - 95% Coverage
  • TMIN - Minimum Temperature - 95% Coverage
  • TOBS - Temperature at time of observation - 94% Coverage

  • Precipitation

  • PRCP - Precipitation - 91% Coverage
  • SNOW - Snowfall - 84% Coverage
  • SNWD - Snow Depth - 69% Coverage
  • >>>>>>

    What is our research?

    After that investigation of the NOAA website for each weather stations' available data we can garner some insight over what we might be able to investigate, all in the context of identifying a pattern showing the change in winter weather patterns over time.

    I think we can investigate the following topics:

  • Number of days with a high temperature above freezing (32F, 0C) during winter months
  • Number of days with snowfall during winter months
  • Total snowfall during winter months
  • Average snow pack (snow depth) during winter months

  • I wanted to find the number of days with rain because that highly affects both the snowpack and daily experience during winter, but the PRCP Precipitation data is a little more verbose than I can understand at the moment and will need to take more time to figure it out before I can confidently look into it.

    Let's get our data

    But first let's check to see that we have all the files we need in our directory and ensure we have their file name, then we can load the data.

    To ensure that we can verify what data we are working with at any step, let's also add some identifying data to the dataframes. We will do this by adding columns, one with the name of the file itself, and one with a broad location identifier.

    In [39]:
    import os
    
    #List files in directory
    print('The files in our directory are:')
    for file_path in os.scandir('Data/'):
        if file_path.is_file():
            print(file_path.name)
    
    The files in our directory are:
    NOAA Daily Summaries - Alfred, NY.csv
    NOAA Daily Summaries - Bath, NY.csv
    NOAA Daily Summaries - Dansville, NY.csv
    

    Okay, we have the files we downloaded

    Now we can get to importing the data and using it.

    In [41]:
    #import the data
    import pandas as pd
    pd.options.mode.chained_assignment = None
    
    #Specify our file names
    csv_alfred = 'Data/NOAA Daily Summaries - Alfred, NY.csv'
    csv_bath = 'Data/NOAA Daily Summaries - Bath, NY.csv'
    csv_dansville = 'Data/NOAA Daily Summaries - Dansville, NY.csv'
    
    #import Alfred data, and add filename column
    df_alfred = pd.read_csv(csv_alfred, low_memory=False) #Read file into dataframe
    df_alfred.insert(0,'FILE_NAME',os.path.basename(csv_alfred)) #Insert a row with the file name
    df_alfred.insert(1,'LOCATION','Alfred, NY') #Insert a row with broad Location name
    
    #import Bath data
    df_bath = pd.read_csv(csv_bath, low_memory=False) #Read file into dataframe
    df_bath.insert(0,'FILE_NAME',os.path.basename(csv_alfred)) #Insert a row with the file name
    df_bath.insert(1,'LOCATION','Bath, NY') #Insert a row with broad Location name
    
    #import Dansville data
    df_dansville = pd.read_csv(csv_dansville, low_memory=False) #Read file into dataframe
    df_dansville.insert(0,'FILE_NAME',os.path.basename(csv_alfred)) #Insert a row with the file name
    df_dansville.insert(1,'LOCATION','Dansville, NY') #Insert a row with broad Location name
    
    In [42]:
    #Verify the data fields in each file
    print('File Name:',csv_alfred)
    print('Location:',df_alfred['LOCATION'].iloc[0])
    df_alfred.info()
    
    File Name: Data/NOAA Daily Summaries - Alfred, NY.csv
    Location: Alfred, NY
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 43858 entries, 0 to 43857
    Data columns (total 52 columns):
     #   Column           Non-Null Count  Dtype  
    ---  ------           --------------  -----  
     0   FILE_NAME        43858 non-null  object 
     1   LOCATION         43858 non-null  object 
     2   STATION          43858 non-null  object 
     3   NAME             43858 non-null  object 
     4   LATITUDE         43858 non-null  float64
     5   LONGITUDE        43858 non-null  float64
     6   ELEVATION        43858 non-null  float64
     7   DATE             43858 non-null  object 
     8   DAPR             6 non-null      float64
     9   DAPR_ATTRIBUTES  6 non-null      object 
     10  DASF             1 non-null      float64
     11  DASF_ATTRIBUTES  1 non-null      object 
     12  MDPR             6 non-null      float64
     13  MDPR_ATTRIBUTES  6 non-null      object 
     14  MDSF             1 non-null      float64
     15  MDSF_ATTRIBUTES  1 non-null      object 
     16  PRCP             42059 non-null  float64
     17  PRCP_ATTRIBUTES  42059 non-null  object 
     18  SNOW             38702 non-null  float64
     19  SNOW_ATTRIBUTES  38702 non-null  object 
     20  SNWD             33925 non-null  float64
     21  SNWD_ATTRIBUTES  33925 non-null  object 
     22  TMAX             43496 non-null  float64
     23  TMAX_ATTRIBUTES  43496 non-null  object 
     24  TMIN             43532 non-null  float64
     25  TMIN_ATTRIBUTES  43532 non-null  object 
     26  TOBS             39470 non-null  float64
     27  TOBS_ATTRIBUTES  39470 non-null  object 
     28  WESD             68 non-null     float64
     29  WESD_ATTRIBUTES  68 non-null     object 
     30  WT01             253 non-null    float64
     31  WT01_ATTRIBUTES  253 non-null    object 
     32  WT03             694 non-null    float64
     33  WT03_ATTRIBUTES  694 non-null    object 
     34  WT04             254 non-null    float64
     35  WT04_ATTRIBUTES  254 non-null    object 
     36  WT05             29 non-null     float64
     37  WT05_ATTRIBUTES  29 non-null     object 
     38  WT06             37 non-null     float64
     39  WT06_ATTRIBUTES  37 non-null     object 
     40  WT07             1 non-null      float64
     41  WT07_ATTRIBUTES  1 non-null      object 
     42  WT09             2 non-null      float64
     43  WT09_ATTRIBUTES  2 non-null      object 
     44  WT11             88 non-null     float64
     45  WT11_ATTRIBUTES  88 non-null     object 
     46  WT14             3 non-null      float64
     47  WT14_ATTRIBUTES  3 non-null      object 
     48  WT16             1 non-null      float64
     49  WT16_ATTRIBUTES  1 non-null      object 
     50  WT18             4 non-null      float64
     51  WT18_ATTRIBUTES  4 non-null      object 
    dtypes: float64(25), object(27)
    memory usage: 17.4+ MB
    
    In [43]:
    #Verify the data fields in each file
    print('File Name:',csv_bath)
    print('Location:',df_bath['LOCATION'].iloc[0])
    df_bath.info()
    
    File Name: Data/NOAA Daily Summaries - Bath, NY.csv
    Location: Bath, NY
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 24741 entries, 0 to 24740
    Data columns (total 44 columns):
     #   Column           Non-Null Count  Dtype  
    ---  ------           --------------  -----  
     0   FILE_NAME        24741 non-null  object 
     1   LOCATION         24741 non-null  object 
     2   STATION          24741 non-null  object 
     3   NAME             24741 non-null  object 
     4   LATITUDE         24741 non-null  float64
     5   LONGITUDE        24741 non-null  float64
     6   ELEVATION        24741 non-null  float64
     7   DATE             24741 non-null  object 
     8   DAPR             247 non-null    float64
     9   DAPR_ATTRIBUTES  247 non-null    object 
     10  DASF             3 non-null      float64
     11  DASF_ATTRIBUTES  3 non-null      object 
     12  MDPR             248 non-null    float64
     13  MDPR_ATTRIBUTES  248 non-null    object 
     14  MDSF             4 non-null      float64
     15  MDSF_ATTRIBUTES  4 non-null      object 
     16  PRCP             24412 non-null  float64
     17  PRCP_ATTRIBUTES  24412 non-null  object 
     18  SNOW             23025 non-null  float64
     19  SNOW_ATTRIBUTES  23025 non-null  object 
     20  SNWD             20704 non-null  float64
     21  SNWD_ATTRIBUTES  20704 non-null  object 
     22  TMAX             17315 non-null  float64
     23  TMAX_ATTRIBUTES  17315 non-null  object 
     24  TMIN             17363 non-null  float64
     25  TMIN_ATTRIBUTES  17363 non-null  object 
     26  TOBS             17383 non-null  float64
     27  TOBS_ATTRIBUTES  17383 non-null  object 
     28  WESD             16 non-null     float64
     29  WESD_ATTRIBUTES  16 non-null     object 
     30  WESF             25 non-null     float64
     31  WESF_ATTRIBUTES  25 non-null     object 
     32  WT01             287 non-null    float64
     33  WT01_ATTRIBUTES  287 non-null    object 
     34  WT03             146 non-null    float64
     35  WT03_ATTRIBUTES  146 non-null    object 
     36  WT04             24 non-null     float64
     37  WT04_ATTRIBUTES  24 non-null     object 
     38  WT05             8 non-null      float64
     39  WT05_ATTRIBUTES  8 non-null      object 
     40  WT06             9 non-null      float64
     41  WT06_ATTRIBUTES  9 non-null      object 
     42  WT11             15 non-null     float64
     43  WT11_ATTRIBUTES  15 non-null     object 
    dtypes: float64(21), object(23)
    memory usage: 8.3+ MB
    
    In [44]:
    #Verify the data fields in each file
    print('File Name:',csv_dansville)
    print('Location:',df_dansville['LOCATION'].iloc[0])
    df_dansville.info()
    
    File Name: Data/NOAA Daily Summaries - Dansville, NY.csv
    Location: Dansville, NY
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 37760 entries, 0 to 37759
    Data columns (total 46 columns):
     #   Column           Non-Null Count  Dtype  
    ---  ------           --------------  -----  
     0   FILE_NAME        37760 non-null  object 
     1   LOCATION         37760 non-null  object 
     2   STATION          37760 non-null  object 
     3   NAME             37760 non-null  object 
     4   LATITUDE         37760 non-null  float64
     5   LONGITUDE        37760 non-null  float64
     6   ELEVATION        37760 non-null  float64
     7   DATE             37760 non-null  object 
     8   DAPR             8 non-null      float64
     9   DAPR_ATTRIBUTES  8 non-null      object 
     10  DASF             1 non-null      float64
     11  DASF_ATTRIBUTES  1 non-null      object 
     12  MDPR             9 non-null      float64
     13  MDPR_ATTRIBUTES  9 non-null      object 
     14  MDSF             1 non-null      float64
     15  MDSF_ATTRIBUTES  1 non-null      object 
     16  PRCP             37319 non-null  float64
     17  PRCP_ATTRIBUTES  37319 non-null  object 
     18  SNOW             32589 non-null  float64
     19  SNOW_ATTRIBUTES  32589 non-null  object 
     20  SNWD             26951 non-null  float64
     21  SNWD_ATTRIBUTES  26951 non-null  object 
     22  TMAX             36842 non-null  float64
     23  TMAX_ATTRIBUTES  36842 non-null  object 
     24  TMIN             36884 non-null  float64
     25  TMIN_ATTRIBUTES  36884 non-null  object 
     26  TOBS             36777 non-null  float64
     27  TOBS_ATTRIBUTES  36777 non-null  object 
     28  WT01             139 non-null    float64
     29  WT01_ATTRIBUTES  139 non-null    object 
     30  WT03             300 non-null    float64
     31  WT03_ATTRIBUTES  300 non-null    object 
     32  WT04             37 non-null     float64
     33  WT04_ATTRIBUTES  37 non-null     object 
     34  WT05             5 non-null      float64
     35  WT05_ATTRIBUTES  5 non-null      object 
     36  WT06             5 non-null      float64
     37  WT06_ATTRIBUTES  5 non-null      object 
     38  WT08             2 non-null      float64
     39  WT08_ATTRIBUTES  2 non-null      object 
     40  WT09             16 non-null     float64
     41  WT09_ATTRIBUTES  16 non-null     object 
     42  WT11             21 non-null     float64
     43  WT11_ATTRIBUTES  21 non-null     object 
     44  WT14             98 non-null     float64
     45  WT14_ATTRIBUTES  98 non-null     object 
    dtypes: float64(22), object(24)
    memory usage: 13.3+ MB
    

    Let's see the data we are working with

    We know with our investigation of the website that each of the 3 files share column names with similar data, so we can investigate the file with the highest number of columns, which would be the file for Alfred, NY.

    As we can see we only have roughly 45000 rows for the largest file, Alfred, so with scrolling cells here we can look at the data and verify what we might need to use

    We can quickly see that only a small handful of the columns have non-null (rows with data) data, so we can most likely ignore those columns, either only selecting the other columns later, removing them from the dataframe, or re-creating the dataframe with only the useful columns included. We can also see that many of the columns with useful data have attribute columns - those are codified data in a multivalue structure to help understand the data better.

    With so few rows we can also look at the data in Excel, Google Sheets, LibreOffice Calc, or Notepad(++). The notepad options are more difficult to understand because it will not be displaying cells.

    In [46]:
    #Display the dataframe as a table
    from IPython.display import display
    pd.set_option('display.max_columns', None) #Display all of the columns
    print('Location:',df_alfred['LOCATION'].iloc[0]) #Show the file location above the table to verify our file
    display(df_alfred) #Display the data
    
    Location: Alfred, NY
    
    FILE_NAME LOCATION STATION NAME LATITUDE LONGITUDE ELEVATION DATE DAPR DAPR_ATTRIBUTES DASF DASF_ATTRIBUTES MDPR MDPR_ATTRIBUTES MDSF MDSF_ATTRIBUTES PRCP PRCP_ATTRIBUTES SNOW SNOW_ATTRIBUTES SNWD SNWD_ATTRIBUTES TMAX TMAX_ATTRIBUTES TMIN TMIN_ATTRIBUTES TOBS TOBS_ATTRIBUTES WESD WESD_ATTRIBUTES WT01 WT01_ATTRIBUTES WT03 WT03_ATTRIBUTES WT04 WT04_ATTRIBUTES WT05 WT05_ATTRIBUTES WT06 WT06_ATTRIBUTES WT07 WT07_ATTRIBUTES WT09 WT09_ATTRIBUTES WT11 WT11_ATTRIBUTES WT14 WT14_ATTRIBUTES WT16 WT16_ATTRIBUTES WT18 WT18_ATTRIBUTES
    0 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 42.27985 -77.76643 586.4 1893-01-01 NaN NaN NaN NaN NaN NaN NaN NaN 0.00 P,,6, 0.0 ,,6, NaN NaN 39.0 ,,6 24.0 ,,6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    1 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 42.27985 -77.76643 586.4 1893-01-02 NaN NaN NaN NaN NaN NaN NaN NaN 1.65 ,,6, 0.0 ,,6, NaN NaN 34.0 ,,6 20.0 ,,6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    2 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 42.27985 -77.76643 586.4 1893-01-03 NaN NaN NaN NaN NaN NaN NaN NaN 0.30 ,,6, 3.0 ,,6, NaN NaN 20.0 ,,6 4.0 ,,6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    3 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 42.27985 -77.76643 586.4 1893-01-04 NaN NaN NaN NaN NaN NaN NaN NaN 0.20 ,,6, 2.0 ,,6, NaN NaN 10.0 ,,6 3.0 ,,6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    4 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 42.27985 -77.76643 586.4 1893-01-05 NaN NaN NaN NaN NaN NaN NaN NaN 0.30 ,,6, 3.0 ,,6, NaN NaN 15.0 ,,6 0.0 ,,6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    43853 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 42.27985 -77.76643 586.4 2024-08-20 NaN NaN NaN NaN NaN NaN NaN NaN 0.08 ,,H,0700 0.0 ,,H,0700 0.0 ,,H,0700 64.0 ,,H 48.0 ,,H 51.0 ,,H,0700 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    43854 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 42.27985 -77.76643 586.4 2024-08-21 NaN NaN NaN NaN NaN NaN NaN NaN 0.00 ,,H,0700 0.0 ,,H,0700 0.0 ,,H,0700 61.0 ,,H 47.0 ,,H 48.0 ,,H,0700 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    43855 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 42.27985 -77.76643 586.4 2024-08-22 NaN NaN NaN NaN NaN NaN NaN NaN 0.02 ,,H,0700 0.0 ,,H,0700 0.0 ,,H,0700 64.0 ,,H 47.0 ,,H 51.0 ,,H,0700 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    43856 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 42.27985 -77.76643 586.4 2024-08-23 NaN NaN NaN NaN NaN NaN NaN NaN 0.00 ,,H,0700 0.0 ,,H,0700 0.0 ,,H,0700 73.0 ,,H 51.0 ,,H 52.0 ,,H,0700 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    43857 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 42.27985 -77.76643 586.4 2024-08-24 NaN NaN NaN NaN NaN NaN NaN NaN 0.00 ,,H,0700 0.0 ,,H,0700 NaN NaN 78.0 ,,H 52.0 ,,H 52.0 ,,H,0700 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

    43858 rows × 52 columns

    Let's reload the dataframe with only the useful columns

    In [48]:
    #import Alfred data, and add filename column
    df_alfred = pd.read_csv(csv_alfred, usecols = ['STATION','NAME','DATE','TMAX','TMIN','TOBS','PRCP','SNOW','SNWD'], low_memory=False) #Read file into dataframe
    df_alfred['DATE']= pd.to_datetime(df_alfred['DATE']) #Convert the DATE column from an object in the dataframe to a Date format
    df_alfred.insert(0,'FILE_NAME',os.path.basename(csv_alfred)) #Insert a row with the file name
    df_alfred.insert(1,'LOCATION','Alfred, NY') #Insert a row with broad Location name
    
    #import Bath data
    df_bath = pd.read_csv(csv_bath, usecols = ['STATION','NAME','DATE','TMAX','TMIN','TOBS','PRCP','SNOW','SNWD'], low_memory=False) #Read file into dataframe
    df_bath['DATE']= pd.to_datetime(df_bath['DATE']) #Convert the DATE column from an object in the dataframe to a Date format
    df_bath.insert(0,'FILE_NAME',os.path.basename(csv_alfred)) #Insert a row with the file name
    df_bath.insert(1,'LOCATION','Bath, NY') #Insert a row with broad Location name
    
    #import Dansville data
    df_dansville = pd.read_csv(csv_dansville, usecols = ['STATION','NAME','DATE','TMAX','TMIN','TOBS','PRCP','SNOW','SNWD'], low_memory=False) #Read file into dataframe
    df_dansville['DATE']= pd.to_datetime(df_dansville['DATE']) #Convert the DATE column from an object in the dataframe to a Date format
    df_dansville.insert(0,'FILE_NAME',os.path.basename(csv_alfred)) #Insert a row with the file name
    df_dansville.insert(1,'LOCATION','Dansville, NY') #Insert a row with broad Location name
    
    #Verify the data fields in each file
    print('File Name:',csv_alfred)
    print('Location:',df_alfred['LOCATION'].iloc[0],'\n')
    df_alfred.info()
    print('\n')
    
    #Display the dataframe as a table
    from IPython.display import display
    pd.set_option('display.max_columns', None) #Display all of the columns
    print('Location:',df_alfred['LOCATION'].iloc[0]) #Show the file location above the table to verify our file
    with pd.option_context("display.max_rows", 15): display(df_alfred) #Display the data
    
    File Name: Data/NOAA Daily Summaries - Alfred, NY.csv
    Location: Alfred, NY 
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 43858 entries, 0 to 43857
    Data columns (total 11 columns):
     #   Column     Non-Null Count  Dtype         
    ---  ------     --------------  -----         
     0   FILE_NAME  43858 non-null  object        
     1   LOCATION   43858 non-null  object        
     2   STATION    43858 non-null  object        
     3   NAME       43858 non-null  object        
     4   DATE       43858 non-null  datetime64[ns]
     5   PRCP       42059 non-null  float64       
     6   SNOW       38702 non-null  float64       
     7   SNWD       33925 non-null  float64       
     8   TMAX       43496 non-null  float64       
     9   TMIN       43532 non-null  float64       
     10  TOBS       39470 non-null  float64       
    dtypes: datetime64[ns](1), float64(6), object(4)
    memory usage: 3.7+ MB
    
    
    Location: Alfred, NY
    
    FILE_NAME LOCATION STATION NAME DATE PRCP SNOW SNWD TMAX TMIN TOBS
    0 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 1893-01-01 0.00 0.0 NaN 39.0 24.0 NaN
    1 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 1893-01-02 1.65 0.0 NaN 34.0 20.0 NaN
    2 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 1893-01-03 0.30 3.0 NaN 20.0 4.0 NaN
    3 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 1893-01-04 0.20 2.0 NaN 10.0 3.0 NaN
    4 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 1893-01-05 0.30 3.0 NaN 15.0 0.0 NaN
    ... ... ... ... ... ... ... ... ... ... ... ...
    43853 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 2024-08-20 0.08 0.0 0.0 64.0 48.0 51.0
    43854 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 2024-08-21 0.00 0.0 0.0 61.0 47.0 48.0
    43855 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 2024-08-22 0.02 0.0 0.0 64.0 47.0 51.0
    43856 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 2024-08-23 0.00 0.0 0.0 73.0 51.0 52.0
    43857 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 2024-08-24 0.00 0.0 NaN 78.0 52.0 52.0

    43858 rows × 11 columns

    There, that's better

    Now we can see and understand things a little better, they're much more human-readable and I am, after all, a human.

    From the documentation PDF from NOAA we can see that both the Precipitation (PRCP) and Snowfall (SNOW) columns are in inches.

    Let's tackle our first question: How many days during a winter month had temperatures above freezing?

    Winter in the northern hemisphere is classified as starting on the winter solstice (year's shortest day) and ending on the vernal equinox (day and night equal in length). That is generally December 21st or 22nd to March 20th or 21st. While I, myself, know that we've had large storms in November, let's just explore November later and stick to the entire month of December through the entire month of March for simplicity.

    We'll start with some more data validation.

    In [144]:
    pd.set_option('display.max_rows',25) #show all rows for this section
    df_alfred_year_rows = df_alfred.groupby(df_alfred.DATE.dt.year)['DATE'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year
    df_alfred_year_rows_rslt = df_alfred_year_rows.loc[df_alfred_year_rows['Rows'] < 365] #Create a new frame with only the years that have less than 365 rows
    
    print(len(df_alfred_year_rows_rslt.index), 'years are missing some data.')
    display(df_alfred_year_rows_rslt)
    
    24 years are missing some data.
    
    DATE Rows
    4 1897 304
    7 1900 356
    8 1901 243
    9 1910 245
    11 1912 321
    26 1927 361
    38 1939 364
    44 1945 363
    45 1946 361
    46 1947 364
    49 1950 364
    53 1954 364
    63 1964 335
    88 1989 303
    89 1990 337
    90 1991 334
    92 1993 335
    93 1994 303
    106 2007 243
    109 2010 120
    110 2011 53
    111 2012 362
    113 2014 364
    123 2024 236

    Hmmmmmmmmm

    Well, since we're only interested in data from winter, how about we look just at those. With the months of December, January, February, and March, we should have a total of 122 days, accounting for a leap year.

    In [146]:
    df_alfred_winter = df_alfred.loc[df_alfred['DATE'].dt.month.isin([12, 1, 2, 3])]  #Create a new frame with only months in winter
    df_alfred_winter_rows = df_alfred_winter.groupby(df_alfred_winter.DATE.dt.year)['DATE'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year 
    df_alfred_winter_rows_rslt = df_alfred_winter_rows.loc[df_alfred_winter_rows['Rows'] < 121] #Create a new frame with only the years that have less than 121 rows
    
    print(len(df_alfred_winter_rows_rslt.index), 'years are missing some winter data.')
    display(df_alfred_winter_rows_rslt)
    
    11 years are missing some winter data.
    
    DATE Rows
    4 1897 90
    9 1910 31
    11 1912 85
    44 1945 119
    45 1946 119
    88 1989 90
    89 1990 93
    90 1991 90
    109 2010 90
    110 2012 118
    122 2024 91

    But wait, there's something wrong!

    We have to massage the data a little bit. If we are trying to look at winter seasons that span the change of a calendar year, we aren't comparing apples to apples because December 2020 would stay in 2020 while January 2021 would be in 2021, when we actually want them combined.

    In [55]:
    import numpy as np
    #Insert a new column
    df_alfred_winter.insert(5,'WINTER_YEAR',df_alfred_winter['DATE'].dt.year)
    #del df_alfred_winter['WINTER_YEAR']
    
    #Replace the year of the date field in the new column with 1 year higher if in October, November, or December
    df_alfred_winter['WINTER_YEAR'] = np.where(df_alfred_winter['DATE'].dt.month.isin([10, 11, 12]), df_alfred_winter['WINTER_YEAR']+1, df_alfred_winter['WINTER_YEAR'])
    #df_alfred_winter['WINTER_YEAR'] = df_alfred_winter.loc[df_alfred_winter['DATE'].dt.month.isin([10, 11, 12]), df_alfred_winter['WINTER_YEAR']] = df_alfred_winter['WINTER_YEAR']+1
    with pd.option_context("display.max_rows", 15): display(df_alfred_winter)
    
    FILE_NAME LOCATION STATION NAME DATE WINTER_YEAR PRCP SNOW SNWD TMAX TMIN TOBS
    0 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 1893-01-01 1893 0.00 0.0 NaN 39.0 24.0 NaN
    1 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 1893-01-02 1893 1.65 0.0 NaN 34.0 20.0 NaN
    2 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 1893-01-03 1893 0.30 3.0 NaN 20.0 4.0 NaN
    3 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 1893-01-04 1893 0.20 2.0 NaN 10.0 3.0 NaN
    4 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 1893-01-05 1893 0.30 3.0 NaN 15.0 0.0 NaN
    ... ... ... ... ... ... ... ... ... ... ... ... ...
    43708 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 2024-03-27 2024 0.00 0.0 0.0 56.0 39.0 45.0
    43709 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 2024-03-28 2024 0.00 0.0 0.0 64.0 29.0 33.0
    43710 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 2024-03-29 2024 0.00 0.0 0.0 49.0 26.0 29.0
    43711 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 2024-03-30 2024 0.00 0.0 0.0 37.0 24.0 28.0
    43712 NOAA Daily Summaries - Alfred, NY.csv Alfred, NY USC00300085 ALFRED, NY US 2024-03-31 2024 0.18 0.0 0.0 47.0 28.0 34.0

    14594 rows × 12 columns

    That seems like we have most of the data from winter that we need

    Our data may come out with some error, but with only 11 years of data missing some dates, I think we will be okay to use this data as-is.

    Let's start gathering some data!

    In [58]:
    import plotly.express as px
    import numpy as np
    df_alfred_winter_maxoverfreeze = df_alfred_winter[df_alfred_winter['TMAX'] > 32]
    #display(df_alfred_winter_maxoverfreeze)
    #df_alfred_winter_maxoverfreeze_agg = df_alfred_winter_maxoverfreeze.groupby(df_alfred_winter_maxoverfreeze.DATE.dt.year)['DATE'].count().reset_index(name='Days_Above_Frezing')
    df_alfred_winter_maxoverfreeze_agg = df_alfred_winter_maxoverfreeze.groupby(df_alfred_winter_maxoverfreeze.WINTER_YEAR)['WINTER_YEAR'].count().reset_index(name='Days_Above_Frezing')
    #display(df_alfred_winter_maxoverfreeze_agg)
    
    #add a new rolling/moving average column
    df_alfred_winter_maxoverfreeze_agg['5YRolling'] = df_alfred_winter_maxoverfreeze_agg['Days_Above_Frezing'].rolling(5).mean().fillna(0).astype(np.int64)
    df_alfred_winter_maxoverfreeze_agg['10YRolling'] = df_alfred_winter_maxoverfreeze_agg['Days_Above_Frezing'].rolling(10).mean().fillna(0).astype(np.int64)
    df_alfred_winter_maxoverfreeze_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
    
    #df_alfred_winter_maxoverfreeze_agg.info()
    #display(df_alfred_winter_maxoverfreeze_agg)
    fig_alfred_winter_maxoverfreeze = px.line(df_alfred_winter_maxoverfreeze_agg, x='WINTER_YEAR', y='Days_Above_Frezing', title='Number of Days With High Temp. Above Freezing in Alfred, NY')
    fig_alfred_winter_maxoverfreeze.add_scatter(x=df_alfred_winter_maxoverfreeze_agg['WINTER_YEAR'], y=df_alfred_winter_maxoverfreeze_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
    fig_alfred_winter_maxoverfreeze.add_scatter(x=df_alfred_winter_maxoverfreeze_agg['WINTER_YEAR'], y=df_alfred_winter_maxoverfreeze_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
    fig_alfred_winter_maxoverfreeze.show()
    

    Oh, that data between 1900 and 1910 is being cranky

    Let's get rid of that and see what it looks like

    In [60]:
    df_alfred_winter_maxoverfreeze = df_alfred_winter_maxoverfreeze.loc[df_alfred_winter['WINTER_YEAR'] > 1909]
    df_alfred_winter_maxoverfreeze_agg = df_alfred_winter_maxoverfreeze.groupby(df_alfred_winter_maxoverfreeze.WINTER_YEAR)['WINTER_YEAR'].count().reset_index(name='Days_Above_Frezing')
    #display(df_alfred_winter_maxoverfreeze_agg)
    #add a new rolling/moving average column
    df_alfred_winter_maxoverfreeze_agg['5YRolling'] = df_alfred_winter_maxoverfreeze_agg['Days_Above_Frezing'].rolling(5).mean().fillna(0).astype(np.int64)
    df_alfred_winter_maxoverfreeze_agg['10YRolling'] = df_alfred_winter_maxoverfreeze_agg['Days_Above_Frezing'].rolling(10).mean().fillna(0).astype(np.int64)
    df_alfred_winter_maxoverfreeze_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
    
    #df_alfred_winter_maxoverfreeze_agg.info()
    #display(df_alfred_winter_maxoverfreeze_agg)
    fig_alfred_winter_maxoverfreeze = px.line(df_alfred_winter_maxoverfreeze_agg, x='WINTER_YEAR', y='Days_Above_Frezing', title='Number of Days with High Temp. Above Freezing in Alfred, NY')
    fig_alfred_winter_maxoverfreeze.add_scatter(x=df_alfred_winter_maxoverfreeze_agg['WINTER_YEAR'], y=df_alfred_winter_maxoverfreeze_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
    fig_alfred_winter_maxoverfreeze.add_scatter(x=df_alfred_winter_maxoverfreeze_agg['WINTER_YEAR'], y=df_alfred_winter_maxoverfreeze_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
    fig_alfred_winter_maxoverfreeze.show()
    

    There might be something here

    It seems like the low points of the troughs in our rolling averages has increased in the most recent dip since the turn of the new millenium, which would suggest that more days over multiple years have had highs above freezing than previous swings

    It also seems like the previous years would average between 65 and 75 days a year, but starting in the 1990s that average increased to be between 70 and 80 days.

    Let's try looking at the Daily Low temperature, TMIN. Maybe this makes sense as the highs could be above freezing during the day but also if the low is above freezing that could certainly make a difference.

    In [62]:
    df_alfred_winter_minoverfreeze = df_alfred_winter[df_alfred_winter['TMIN'] > 32]
    
    df_alfred_winter_minoverfreeze = df_alfred_winter_minoverfreeze.loc[df_alfred_winter['WINTER_YEAR'] > 1909]
    df_alfred_winter_minoverfreeze_agg = df_alfred_winter_minoverfreeze.groupby(df_alfred_winter_minoverfreeze.WINTER_YEAR)['WINTER_YEAR'].count().reset_index(name='Days_Above_Frezing')
    #display(df_alfred_winter_maxoverfreeze_agg)
    #add a new rolling/moving average column
    df_alfred_winter_minoverfreeze_agg['5YRolling'] = df_alfred_winter_minoverfreeze_agg['Days_Above_Frezing'].rolling(5).mean().fillna(0).astype(np.int64)
    df_alfred_winter_minoverfreeze_agg['10YRolling'] = df_alfred_winter_minoverfreeze_agg['Days_Above_Frezing'].rolling(10).mean().fillna(0).astype(np.int64)
    df_alfred_winter_minoverfreeze_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
    
    #df_alfred_winter_maxoverfreeze_agg.info()
    #display(df_alfred_winter_maxoverfreeze_agg)
    fig_alfred_winter_minoverfreeze = px.line(df_alfred_winter_minoverfreeze_agg, x='WINTER_YEAR', y='Days_Above_Frezing', title='Number of Days With Low Temp. Above Freezing in Alfred, NY')
    fig_alfred_winter_minoverfreeze.add_scatter(x=df_alfred_winter_minoverfreeze_agg['WINTER_YEAR'], y=df_alfred_winter_minoverfreeze_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
    fig_alfred_winter_minoverfreeze.add_scatter(x=df_alfred_winter_minoverfreeze_agg['WINTER_YEAR'], y=df_alfred_winter_minoverfreeze_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
    fig_alfred_winter_minoverfreeze.show()
    

    Now here's something!

    If my hypothesis is correct that the low temperature might have more of an impact on winter weather, in keeping snow on the ground and preventing rain, this seems to point to confirming what I have been experiencing and suspecting. If the low temperature on a day is

    While we don't have good data before 1910, it does look like there are still swings, peaks and valleys, but that pattern ended with a trough in the early 1960s, and had never seen another trough again until 2022 where the number of days with a low above freezing doubled compared to the previous trough!

    So let's look at the number of days we had with snow during those winter months

    In [65]:
    df_alfred_winter_snow = df_alfred_winter[df_alfred_winter['SNOW'] > 0]
    df_alfred_winter_snow = df_alfred_winter_snow.loc[df_alfred_winter_snow['WINTER_YEAR'] > 1909]
    df_alfred_winter_snow_agg = df_alfred_winter_snow.groupby(df_alfred_winter_snow.WINTER_YEAR)['WINTER_YEAR'].count().reset_index(name='Days_With_Snow')
    
    
    df_alfred_winter_snow_agg['5YRolling'] = df_alfred_winter_snow_agg['Days_With_Snow'].rolling(5).mean().fillna(0).astype(np.int64)
    df_alfred_winter_snow_agg['10YRolling'] = df_alfred_winter_snow_agg['Days_With_Snow'].rolling(10).mean().fillna(0).astype(np.int64)
    df_alfred_winter_snow_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
    #display(df_alfred_winter_snow_agg)
    with pd.option_context("display.max_rows", 15): display(df_alfred_winter_snow_agg)
    
    WINTER_YEAR Days_With_Snow 5YRolling 10YRolling
    0 1911 35 NaN NaN
    1 1912 10 NaN NaN
    2 1913 22 NaN NaN
    3 1914 44 NaN NaN
    4 1915 41 30.0 NaN
    ... ... ... ... ...
    108 2020 38 42.0 43.0
    109 2021 41 44.0 42.0
    110 2022 35 41.0 43.0
    111 2023 34 38.0 41.0
    112 2024 25 34.0 39.0

    113 rows × 4 columns

    In [66]:
    fig_alfred_winter_snow = px.line(df_alfred_winter_snow_agg, x='WINTER_YEAR', y='Days_With_Snow', title='Number of Days With Snow in Alfred, NY')
    fig_alfred_winter_snow.add_scatter(x=df_alfred_winter_snow_agg['WINTER_YEAR'], y=df_alfred_winter_snow_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
    fig_alfred_winter_snow.add_scatter(x=df_alfred_winter_snow_agg['WINTER_YEAR'], y=df_alfred_winter_snow_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
    fig_alfred_winter_snow.show()
    

    This confirms my personal experience, but doesn't point to a greater long-term issue

    Since the winter of 2010, the first year I was in Alfred, I had noticed that the winters had not been as harsh as before. This confirms this, but does not confirm that these years had been any different than in previous timeframes.

    Let's look at the total accumulated snowfall each year, then.

    In [68]:
    #Take the same dataframe as before where we found all days with SNOW > 0, and sum instead of count
    df_alfred_winter_total_snow_agg = df_alfred_winter_snow.groupby(df_alfred_winter_snow.WINTER_YEAR)['SNOW'].sum().reset_index(name='Total_Snowfall')
    df_alfred_winter_total_snow_agg['5YRolling'] = df_alfred_winter_total_snow_agg['Total_Snowfall'].rolling(5).mean().fillna(0).astype(np.int64)
    df_alfred_winter_total_snow_agg['10YRolling'] = df_alfred_winter_total_snow_agg['Total_Snowfall'].rolling(10).mean().fillna(0).astype(np.int64)
    df_alfred_winter_total_snow_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
    
    with pd.option_context("display.max_rows", 15): display(df_alfred_winter_total_snow_agg)
    
    WINTER_YEAR Total_Snowfall 5YRolling 10YRolling
    0 1911 68.9 NaN NaN
    1 1912 17.8 NaN NaN
    2 1913 28.7 NaN NaN
    3 1914 74.3 NaN NaN
    4 1915 71.3 52.0 NaN
    ... ... ... ... ...
    108 2020 69.2 57.0 55.0
    109 2021 41.8 60.0 53.0
    110 2022 55.9 56.0 56.0
    111 2023 42.9 49.0 54.0
    112 2024 27.9 47.0 50.0

    113 rows × 4 columns

    In [69]:
    fig_alfred_winter_total_snow = px.line(df_alfred_winter_total_snow_agg, x='WINTER_YEAR', y='Total_Snowfall', title='Total Annual Snowfall in Alfred, NY')
    fig_alfred_winter_total_snow.add_scatter(x=df_alfred_winter_total_snow_agg['WINTER_YEAR'], y=df_alfred_winter_total_snow_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
    fig_alfred_winter_total_snow.add_scatter(x=df_alfred_winter_total_snow_agg['WINTER_YEAR'], y=df_alfred_winter_total_snow_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
    fig_alfred_winter_total_snow.show()
    

    Hmmmm, that's interesting too

    It's easy to tell that total snowfall has decreased since a high rolling average in 2009 of ~90 inches to only 50 inches in 2024, but that is now the same average as the 1920s to the 1930s. Interestingly, the total snowfall was steadily increasing over the course of the 20th century, until a drop in the 80s followed by a steep increase until the early 2000s. We will not be able to tell until such time comes but we have not hit a low point yet in the most recent downtrend and it might continue downward still.

    I actually remember that large spike in 1993, "The Blizzard of '93" as so many people remember across the east coast. I was only 2 years old or less, but I remember being carried by my father after the truck got stuck in our neighbor's long driveway, checking in on them after the power went out and collecting with my mother and brother. In a funny way we can use this anecdote as evidence that this data is accurate.

    Let's take one last look at Average Snow Pack

    Before we move on to looking at Dansville to see if our various microclimates in NYS have experienced climate change any differently from each other, let's see if there's been any change to average snow pack between December and March.

    In [72]:
    df_alfred_winter_snowpack = df_alfred_winter
    df_alfred_winter_snowpack = df_alfred_winter_snowpack.loc[df_alfred_winter_snowpack['WINTER_YEAR'] > 1909]
    df_alfred_winter_snowpack_agg = df_alfred_winter_snow.groupby(df_alfred_winter_snow.WINTER_YEAR)['SNWD'].mean().reset_index(name='Average_Snowpack')
    #df_alfred_winter_snowpack_agg['5YRolling'] = df_alfred_winter_total_snow_agg['Average_Snowpack'].rolling(5).mean().fillna(0).astype(np.int64)
    #df_alfred_winter_snowpack_agg['10YRolling'] = df_alfred_winter_total_snow_agg['Average_Snowpack'].rolling(10).mean().fillna(0).astype(np.int64)
    #df_alfred_winter_snowpack_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
    with pd.option_context("display.max_rows", 15): display(df_alfred_winter_snowpack_agg)
    
    WINTER_YEAR Average_Snowpack
    0 1911 NaN
    1 1912 NaN
    2 1913 NaN
    3 1914 10.204545
    4 1915 10.975610
    ... ... ...
    108 2020 3.552632
    109 2021 4.439024
    110 2022 4.257143
    111 2023 2.117647
    112 2024 1.600000

    113 rows × 2 columns

    We have some missing data, but let's see how it looks

    In [74]:
    fig_alfred_winter_snowpack = px.line(df_alfred_winter_snowpack_agg, x='WINTER_YEAR', y='Average_Snowpack', title='Average Annual Snowpack in Alfred, NY')
    #fig_alfred_winter_snowpack.add_scatter(x=df_alfred_winter_snowpack_agg['WINTER_YEAR'], y=df_alfred_winter_snowpack_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
    #fig_alfred_winter_snowpack.add_scatter(x=df_alfred_winter_snowpack_agg['WINTER_YEAR'], y=df_alfred_winter_snowpack_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
    fig_alfred_winter_snowpack.show()
    

    Not much to see here

    Unfortunately there is a good chunk of data missing between 1993 and 2010, which also impact the use of moving averages so there isn't much to look at here. At the least we can see lower figures since 2014 on average compared to the previous years, but it's a bit unscientific to draw a conclusion.

    Let's look at Dansville's weather data

    In [77]:
    pd.set_option('display.max_rows',None) #show all rows for this section
    df_dansville_winter = df_dansville.loc[df_dansville['DATE'].dt.month.isin([12, 1, 2, 3])]  #Create a new frame with only months in winter
    df_dansville_winter.insert(5,'WINTER_YEAR',df_dansville_winter['DATE'].dt.year)
    #del df_dansville_winter['WINTER_YEAR']
    
    #Replace the year of the date field in the new column with 1 year higher if in October, November, or December
    #df_alfred_winter.loc['DATE'].dt.month.isin[10, 11, 12])
    df_dansville_winter['WINTER_YEAR'] = np.where(df_dansville_winter['DATE'].dt.month.isin([10, 11, 12]), df_dansville_winter['WINTER_YEAR']+1, df_dansville_winter['WINTER_YEAR'])
    with pd.option_context("display.max_rows", 15): display(df_dansville_winter)
    
    FILE_NAME LOCATION STATION NAME DATE WINTER_YEAR PRCP SNOW SNWD TMAX TMIN TOBS
    153 NOAA Daily Summaries - Alfred, NY.csv Dansville, NY USC00301974 DANSVILLE, NY US 1918-02-01 1918 0.00 0.0 NaN 32.0 8.0 26.0
    154 NOAA Daily Summaries - Alfred, NY.csv Dansville, NY USC00301974 DANSVILLE, NY US 1918-02-02 1918 0.10 1.0 NaN 26.0 15.0 16.0
    155 NOAA Daily Summaries - Alfred, NY.csv Dansville, NY USC00301974 DANSVILLE, NY US 1918-02-03 1918 0.03 0.3 NaN 34.0 10.0 28.0
    156 NOAA Daily Summaries - Alfred, NY.csv Dansville, NY USC00301974 DANSVILLE, NY US 1918-02-04 1918 0.00 0.0 NaN 50.0 28.0 44.0
    157 NOAA Daily Summaries - Alfred, NY.csv Dansville, NY USC00301974 DANSVILLE, NY US 1918-02-05 1918 0.00 0.0 NaN 44.0 15.0 15.0
    ... ... ... ... ... ... ... ... ... ... ... ... ...
    37609 NOAA Daily Summaries - Alfred, NY.csv Dansville, NY USC00301974 DANSVILLE, NY US 2024-03-27 2024 0.00 0.0 0.0 59.0 43.0 52.0
    37610 NOAA Daily Summaries - Alfred, NY.csv Dansville, NY USC00301974 DANSVILLE, NY US 2024-03-28 2024 0.00 0.0 0.0 67.0 36.0 37.0
    37611 NOAA Daily Summaries - Alfred, NY.csv Dansville, NY USC00301974 DANSVILLE, NY US 2024-03-29 2024 0.00 0.0 0.0 50.0 25.0 27.0
    37612 NOAA Daily Summaries - Alfred, NY.csv Dansville, NY USC00301974 DANSVILLE, NY US 2024-03-30 2024 0.00 0.0 0.0 40.0 27.0 27.0
    37613 NOAA Daily Summaries - Alfred, NY.csv Dansville, NY USC00301974 DANSVILLE, NY US 2024-03-31 2024 0.01 0.0 0.0 52.0 27.0 38.0

    12444 rows × 12 columns

    In [148]:
    df_dansville_year_rows = df_dansville.groupby(df_dansville.DATE.dt.year)['DATE'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year
    df_dansville_year_rows_rslt = df_dansville_year_rows.loc[df_dansville_year_rows['Rows'] < 365] #Create a new frame with only the years that have less than 365 rows
    
    print(len(df_dansville_year_rows_rslt.index), 'years are missing some data:')
    display(df_dansville_year_rows_rslt)
    
    12 years are missing some data:
    
    DATE Rows
    0 1917 153
    1 1918 282
    17 1934 296
    22 1939 334
    23 1941 1
    24 1942 306
    39 1957 364
    69 1987 182
    77 1995 331
    89 2007 243
    92 2010 334
    106 2024 237
    In [150]:
    df_dansville_winter_rows = df_dansville_winter.groupby(df_dansville_winter.DATE.dt.year)['DATE'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year 
    df_dansville_winter_rows_rslt = df_dansville_winter_rows.loc[df_dansville_winter_rows['Rows'] < 121] #Create a new frame with only the years that have less than 121 rows
    
    print(len(df_dansville_winter_rows_rslt.index), 'years are missing some winter data:')
    display(df_dansville_winter_rows_rslt)
    
    7 years are missing some winter data:
    
    DATE Rows
    0 1918 59
    16 1934 52
    21 1939 90
    22 1942 62
    75 1995 120
    87 2007 90
    104 2024 91

    We're off to a better start looking at Dansville

    Luckily less of the data is missing here at the Dansville station. I still want to look at Dansville first because Bath's data has multiple different stations that changed locations and have overlapping entries which makes it much more troublesome to deal with.

    In [81]:
    df_dansville_winter_maxoverfreeze = df_dansville_winter[df_dansville_winter['TMAX'] > 32]
    with pd.option_context("display.max_rows", 15): display(df_dansville_winter_maxoverfreeze)
    #df_dansville_winter_maxoverfreeze_agg = df_dansville_winter_maxoverfreeze.groupby(df_dansville_winter_maxoverfreeze.DATE.dt.year)['DATE'].count().reset_index(name='Days_Above_Frezing')
    df_dansville_winter_maxoverfreeze_agg = df_dansville_winter_maxoverfreeze.groupby(df_dansville_winter_maxoverfreeze.WINTER_YEAR)['WINTER_YEAR'].count().reset_index(name='Days_Above_Frezing')
    #display(df_dansville_winter_maxoverfreeze_agg)
    
    #add a new rolling/moving average column
    df_dansville_winter_maxoverfreeze_agg['5YRolling'] = df_dansville_winter_maxoverfreeze_agg['Days_Above_Frezing'].rolling(5).mean().fillna(0).astype(np.int64)
    df_dansville_winter_maxoverfreeze_agg['10YRolling'] = df_dansville_winter_maxoverfreeze_agg['Days_Above_Frezing'].rolling(10).mean().fillna(0).astype(np.int64)
    df_dansville_winter_maxoverfreeze_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
    
    #df_alfred_winter_maxoverfreeze_agg.info()
    #display(df_alfred_winter_maxoverfreeze_agg)
    fig_dansville_winter_maxoverfreeze = px.line(df_dansville_winter_maxoverfreeze_agg, x='WINTER_YEAR', y='Days_Above_Frezing', title='Number of Days With High Temp. Above Freezing in Dansville, NY')
    fig_dansville_winter_maxoverfreeze.add_scatter(x=df_dansville_winter_maxoverfreeze_agg['WINTER_YEAR'], y=df_dansville_winter_maxoverfreeze_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
    fig_dansville_winter_maxoverfreeze.add_scatter(x=df_dansville_winter_maxoverfreeze_agg['WINTER_YEAR'], y=df_dansville_winter_maxoverfreeze_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
    fig_dansville_winter_maxoverfreeze.show()
    
    FILE_NAME LOCATION STATION NAME DATE WINTER_YEAR PRCP SNOW SNWD TMAX TMIN TOBS
    155 NOAA Daily Summaries - Alfred, NY.csv Dansville, NY USC00301974 DANSVILLE, NY US 1918-02-03 1918 0.03 0.3 NaN 34.0 10.0 28.0
    156 NOAA Daily Summaries - Alfred, NY.csv Dansville, NY USC00301974 DANSVILLE, NY US 1918-02-04 1918 0.00 0.0 NaN 50.0 28.0 44.0
    157 NOAA Daily Summaries - Alfred, NY.csv Dansville, NY USC00301974 DANSVILLE, NY US 1918-02-05 1918 0.00 0.0 NaN 44.0 15.0 15.0
    159 NOAA Daily Summaries - Alfred, NY.csv Dansville, NY USC00301974 DANSVILLE, NY US 1918-02-07 1918 0.00 0.0 NaN 38.0 16.0 35.0
    160 NOAA Daily Summaries - Alfred, NY.csv Dansville, NY USC00301974 DANSVILLE, NY US 1918-02-08 1918 0.50 0.0 NaN 58.0 34.0 48.0
    ... ... ... ... ... ... ... ... ... ... ... ... ...
    37609 NOAA Daily Summaries - Alfred, NY.csv Dansville, NY USC00301974 DANSVILLE, NY US 2024-03-27 2024 0.00 0.0 0.0 59.0 43.0 52.0
    37610 NOAA Daily Summaries - Alfred, NY.csv Dansville, NY USC00301974 DANSVILLE, NY US 2024-03-28 2024 0.00 0.0 0.0 67.0 36.0 37.0
    37611 NOAA Daily Summaries - Alfred, NY.csv Dansville, NY USC00301974 DANSVILLE, NY US 2024-03-29 2024 0.00 0.0 0.0 50.0 25.0 27.0
    37612 NOAA Daily Summaries - Alfred, NY.csv Dansville, NY USC00301974 DANSVILLE, NY US 2024-03-30 2024 0.00 0.0 0.0 40.0 27.0 27.0
    37613 NOAA Daily Summaries - Alfred, NY.csv Dansville, NY USC00301974 DANSVILLE, NY US 2024-03-31 2024 0.01 0.0 0.0 52.0 27.0 38.0

    8311 rows × 12 columns

    In [82]:
    df_dansville_winter_minoverfreeze = df_dansville_winter[df_dansville_winter['TMIN'] > 32]
    
    df_dansville_winter_minoverfreeze = df_dansville_winter_minoverfreeze.loc[df_dansville_winter['WINTER_YEAR'] > 1909]
    df_dansville_winter_minoverfreeze_agg = df_dansville_winter_minoverfreeze.groupby(df_dansville_winter_minoverfreeze.WINTER_YEAR)['WINTER_YEAR'].count().reset_index(name='Days_Above_Frezing')
    #display(df_dansville_winter_maxoverfreeze_agg)
    #add a new rolling/moving average column
    df_dansville_winter_minoverfreeze_agg['5YRolling'] = df_dansville_winter_minoverfreeze_agg['Days_Above_Frezing'].rolling(5).mean().fillna(0).astype(np.int64)
    df_dansville_winter_minoverfreeze_agg['10YRolling'] = df_dansville_winter_minoverfreeze_agg['Days_Above_Frezing'].rolling(10).mean().fillna(0).astype(np.int64)
    df_dansville_winter_minoverfreeze_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
    
    #df_dansville_winter_maxoverfreeze_agg.info()
    #display(df_dansville_winter_maxoverfreeze_agg)
    fig_dansville_winter_minoverfreeze = px.line(df_dansville_winter_minoverfreeze_agg, x='WINTER_YEAR', y='Days_Above_Frezing', title='Number of Days With Low Temp. Above Freezing in Dansville, NY',)#.update_traces(visible="legendonly", selector=lambda t: not t.name in ["Days_Above_Frezing"])
    fig_dansville_winter_minoverfreeze.add_scatter(x=df_dansville_winter_minoverfreeze_agg['WINTER_YEAR'], y=df_dansville_winter_minoverfreeze_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
    fig_dansville_winter_minoverfreeze.add_scatter(x=df_dansville_winter_minoverfreeze_agg['WINTER_YEAR'], y=df_dansville_winter_minoverfreeze_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
    
    fig_dansville_winter_minoverfreeze.show()
    

    This tells a very different story

    Dansville does not seem to have seen an increase in days at high temperatures during winter. It's amazing how much of a difference a 30 minute drive can make in New York in terms of weather.

    Let's see if there's any difference in snowfall.

    In [84]:
    df_dansville_winter_snow = df_dansville_winter[df_dansville_winter['SNOW'] > 0]
    df_dansville_winter_snow = df_dansville_winter_snow.loc[df_dansville_winter_snow['WINTER_YEAR'] > 1909]
    df_dansville_winter_snow_agg = df_dansville_winter_snow.groupby(df_dansville_winter_snow.WINTER_YEAR)['WINTER_YEAR'].count().reset_index(name='Days_With_Snow')
    
    
    df_dansville_winter_snow_agg['5YRolling'] = df_dansville_winter_snow_agg['Days_With_Snow'].rolling(5).mean().fillna(0).astype(np.int64)
    df_dansville_winter_snow_agg['10YRolling'] = df_dansville_winter_snow_agg['Days_With_Snow'].rolling(10).mean().fillna(0).astype(np.int64)
    df_dansville_winter_snow_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
    with pd.option_context("display.max_rows", 15): display(df_dansville_winter_snow_agg)
    
    WINTER_YEAR Days_With_Snow 5YRolling 10YRolling
    0 1918 9 NaN NaN
    1 1919 8 NaN NaN
    2 1920 13 NaN NaN
    3 1921 11 NaN NaN
    4 1922 9 10.0 NaN
    ... ... ... ... ...
    96 2020 13 15.0 16.0
    97 2021 12 15.0 14.0
    98 2022 14 15.0 15.0
    99 2023 10 13.0 15.0
    100 2024 12 12.0 14.0

    101 rows × 4 columns

    In [85]:
    fig_dansville_winter_snow = px.line(df_dansville_winter_snow_agg, x='WINTER_YEAR', y='Days_With_Snow', title='Number of Days With Snow in Dansville, NY')
    fig_dansville_winter_snow.add_scatter(x=df_dansville_winter_snow_agg['WINTER_YEAR'], y=df_dansville_winter_snow_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
    fig_dansville_winter_snow.add_scatter(x=df_dansville_winter_snow_agg['WINTER_YEAR'], y=df_dansville_winter_snow_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
    fig_dansville_winter_snow.show()
    
    In [86]:
    #Take the same dataframe as before where we found all days with SNOW > 0, and sum instead of count
    df_dansville_winter_total_snow_agg = df_dansville_winter_snow.groupby(df_dansville_winter_snow.WINTER_YEAR)['SNOW'].sum().reset_index(name='Total_Snowfall')
    df_dansville_winter_total_snow_agg['5YRolling'] = df_dansville_winter_total_snow_agg['Total_Snowfall'].rolling(5).mean().fillna(0).astype(np.int64)
    df_dansville_winter_total_snow_agg['10YRolling'] = df_dansville_winter_total_snow_agg['Total_Snowfall'].rolling(10).mean().fillna(0).astype(np.int64)
    df_dansville_winter_total_snow_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
    with pd.option_context("display.max_rows", 15): display(df_dansville_winter_total_snow_agg)
    
    WINTER_YEAR Total_Snowfall 5YRolling 10YRolling
    0 1918 6.5 NaN NaN
    1 1919 23.3 NaN NaN
    2 1920 23.7 NaN NaN
    3 1921 15.8 NaN NaN
    4 1922 17.1 17.0 NaN
    ... ... ... ... ...
    96 2020 23.8 29.0 31.0
    97 2021 29.4 32.0 29.0
    98 2022 31.7 33.0 31.0
    99 2023 15.1 26.0 30.0
    100 2024 13.0 22.0 27.0

    101 rows × 4 columns

    In [87]:
    fig_dansville_winter_total_snow = px.line(df_dansville_winter_total_snow_agg, x='WINTER_YEAR', y='Total_Snowfall', title='Total Annual Snowfall in Dansville, NY')
    fig_dansville_winter_total_snow.add_scatter(x=df_dansville_winter_total_snow_agg['WINTER_YEAR'], y=df_dansville_winter_total_snow_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
    fig_dansville_winter_total_snow.add_scatter(x=df_dansville_winter_total_snow_agg['WINTER_YEAR'], y=df_dansville_winter_total_snow_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
    fig_dansville_winter_total_snow.show()
    

    Is 100 years of data enough?

    We tend to generally understand that Earth's climate changes in cycles - this would be a very long cycle. But we might be able to discern that we are on the downward side, the ending side of another cycle, which does not bode well for the next decade.

    From the 1920s through the 1940s it looks like there was a steady average of 20-25 inches of snow throughout winter in Dansville, increasing in the 50s through to 1980, and then starting a downward trend. What's noticable is the determined drop in snowfall into the Nineties below the previous long-range average in the early 20th century. While snowfall trended upwards through the 2000s until 2020, it looks like Dansville is trending downward again and the peak rolling average was only about 60% of the previous cycle's rolling peak.

    I am concerned by this. If we are truly on a downward trend we might continue seeing new low averages, in terms of annual snowfall, lower high averages, and lower annual snowfall in general. If the cycles last this long then I might not be confident in making a definitive statement, but I am still concerned at what this suggests right now.

    Hold on now, what about 1993?

    Where is the large spike in 1993? And, even 1931 showed up in Alfred's data and yet isn't present here in Dansville, another year with a famous blizzard. Dansville was also affected strongly by the Blizzard of '93, this makes me think that the data from Dansville's weather station is not accurate.

    One more time, with feeling - what about snowpack?

    In [90]:
    df_dansville_winter_snowpack = df_dansville_winter
    df_dansville_winter_snowpack = df_dansville_winter_snowpack.loc[df_dansville_winter_snowpack['WINTER_YEAR'] > 1909]
    df_dansville_winter_snowpack_agg = df_dansville_winter_snow.groupby(df_dansville_winter_snow.WINTER_YEAR)['SNWD'].mean().reset_index(name='Average_Snowpack')
    #df_dansville_winter_snowpack_agg['5YRolling'] = df_dansville_winter_total_snow_agg['Average_Snowpack'].rolling(5).mean().fillna(0).astype(np.int64)
    #df_dansville_winter_snowpack_agg['10YRolling'] = df_dansville_winter_total_snow_agg['Average_Snowpack'].rolling(10).mean().fillna(0).astype(np.int64)
    #df_dansville_winter_snowpack_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
    with pd.option_context("display.max_rows", 15): display(df_dansville_winter_snowpack_agg)
    
    WINTER_YEAR Average_Snowpack
    0 1918 NaN
    1 1919 2.666667
    2 1920 4.750000
    3 1921 2.666667
    4 1922 2.166667
    ... ... ...
    96 2020 2.923077
    97 2021 3.666667
    98 2022 4.714286
    99 2023 1.500000
    100 2024 1.250000

    101 rows × 2 columns

    In [91]:
    fig_dansville_winter_snowpack = px.line(df_dansville_winter_snowpack_agg, x='WINTER_YEAR', y='Average_Snowpack', title='Average Annual Snowpack in Dansville, NY')
    #fig_dansville_winter_snowpack.add_scatter(x=df_dansville_winter_snowpack_agg['WINTER_YEAR'], y=df_dansville_winter_snowpack_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
    #fig_dansville_winter_snowpack.add_scatter(x=df_dansville_winter_snowpack_agg['WINTER_YEAR'], y=df_dansville_winter_snowpack_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
    fig_dansville_winter_snowpack.show()
    

    It still seems that our snow depth data is a little lacking.

    With the note about the snowfall data from Dansville, I don't know if we can use Dansville to draw any conclusions.

    Let's start looking at Bath, then.

    As a reminder, here is what data we have available. As mentioned before, from NOAA's notes, Coverage is an approximation of total completeness based on the most complete data element, and the overall data range. But also, consistency is just as important as accuracy with data so let's see what data we can cull so we have a single row of data for each day in similar locations over time.

    Bath, NY

    The Bath, NY data is comprised of data in 6 stations in different locations since 1953.

  • BATH, NY US - GHCND:USC00300448 - 1953/08/01 to 2014/09/30 - 91% Coverage
  • AVOCA 5.0 SSW, NY US GHCND:US1NYST0023 - 2008/01/11 to 2011/09/29 - 91% Coverage
  • BATH 4.2 E, NY US GHCND:US1NYST0033 - 2010/05/14 to 2024/08/25 - 47% Coverage
  • BATH 0.4 N, NY US GHCND:US1NYST0029 - 2010/09/01 to 2012/07/12 - 36% Coverage
  • BATH 6.8 SW, NY US GHCND:US1NYST0035 - 2016/04/02 to 2022/03/31 - 36% Coverage
  • BATH 1.3 E, NY US GHCND:US1NYST0044 - 2020/05/01 to 2024/08/26 - 33% Coverage
  • The overall "Bath, NY" station covers from 1953/08/01 until 2014/09/30, so we'll start there. The Avoca station covers dates that are all already reported by the "Bath, NY" station so we'll just get rid of all of the Avoca data, same with Bath 0.4. Bath 4.2 then covers up to 2024/08/25, but has some earlier data that overlaps the original "Bath, NY" station, so let's cull Bath 4.2 E earlier than 2014/09/30. The other stations, Bath all exist in shorter timeframes that are already covered by the other stations we are using, so let's get rid of them outright.

    In [94]:
    #Filtering the Bath, NY Dataframe
    #df_alfred_winter = df_alfred.loc[df_alfred['DATE'].dt.month.isin([12, 1, 2, 3])]
    print('The raw Bath, NY dataset contains:')
    df_bath.info()
    
    print('\n')
    print('We can validate our removed rows by checking the following:')
    df_bath_stations = df_bath.groupby(df_bath.NAME)['NAME'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year
    display(df_bath_stations)
    
    print('\n')
    print('Here is how many rows we want to delete from Bath 4.2 that are earlier than 2014/09/30:')
    df_bath_42 = df_bath[(df_bath['NAME'].isin(['BATH 4.2 E, NY US']))]
    df_bath_42['DELETE_FLAG'] = ''
    df_bath_42['DELETE_FLAG'] = np.where(df_bath_42['DATE'] <= '2014-09-30', 1, 0)
    df_bath_42_count = df_bath_42.groupby(df_bath_42.DELETE_FLAG)['DELETE_FLAG'].count().reset_index(name='Rows')
    display(df_bath_42_count)
    
    df_bath_filtered = df_bath[~(
                             (df_bath['NAME'].isin(['AVOCA 5.0 SSW, NY US', 'BATH 0.4 N, NY US', 'BATH 6.8 SW, NY US', 'BATH 1.3 E, NY US']))
                               |
                             ((df_bath['NAME'].isin(['BATH 4.2 E, NY US'])) & (df_bath['DATE'] <= '2014-09-30'))
                            )
                          ]
    
    print('\n')
    print('The new dataset contains:')
    df_bath_filtered.info()
    
    The raw Bath, NY dataset contains:
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 24741 entries, 0 to 24740
    Data columns (total 11 columns):
     #   Column     Non-Null Count  Dtype         
    ---  ------     --------------  -----         
     0   FILE_NAME  24741 non-null  object        
     1   LOCATION   24741 non-null  object        
     2   STATION    24741 non-null  object        
     3   NAME       24741 non-null  object        
     4   DATE       24741 non-null  datetime64[ns]
     5   PRCP       24412 non-null  float64       
     6   SNOW       23025 non-null  float64       
     7   SNWD       20704 non-null  float64       
     8   TMAX       17315 non-null  float64       
     9   TMIN       17363 non-null  float64       
     10  TOBS       17383 non-null  float64       
    dtypes: datetime64[ns](1), float64(6), object(4)
    memory usage: 2.1+ MB
    
    
    We can validate our removed rows by checking the following:
    
    NAME Rows
    0 AVOCA 5.0 SSW, NY US 193
    1 BATH 0.4 N, NY US 300
    2 BATH 1.3 E, NY US 628
    3 BATH 4.2 E, NY US 2488
    4 BATH 6.8 SW, NY US 811
    5 BATH, NY US 20321
    
    Here is how many rows we want to delete from Bath 4.2 that are earlier than 2014/09/30:
    
    DELETE_FLAG Rows
    0 0 1898
    1 1 590
    
    The new dataset contains:
    <class 'pandas.core.frame.DataFrame'>
    Index: 22219 entries, 1111 to 24112
    Data columns (total 11 columns):
     #   Column     Non-Null Count  Dtype         
    ---  ------     --------------  -----         
     0   FILE_NAME  22219 non-null  object        
     1   LOCATION   22219 non-null  object        
     2   STATION    22219 non-null  object        
     3   NAME       22219 non-null  object        
     4   DATE       22219 non-null  datetime64[ns]
     5   PRCP       22111 non-null  float64       
     6   SNOW       21813 non-null  float64       
     7   SNWD       20575 non-null  float64       
     8   TMAX       17315 non-null  float64       
     9   TMIN       17363 non-null  float64       
     10  TOBS       17383 non-null  float64       
    dtypes: datetime64[ns](1), float64(6), object(4)
    memory usage: 2.0+ MB
    

    The data for Bath is cleaned, let's get started!

    We counted the rows from each station so we know that we want to get rid of 1,932 rows from Avoca, Bath 0.4, Bath 6.8, and Bath 1.3. We also need to remove 590 rows from Bath 4.2 where its reporting date is earlier than 2014/09/30. That should leave 22,219 rows.

    Now let's reduce the Bath data to just the winter months again.

    In [96]:
    df_bath_winter = df_bath_filtered.loc[df_bath_filtered['DATE'].dt.month.isin([12, 1, 2, 3])]  #Create a new frame with only months in winter
    df_bath_winter.insert(5,'WINTER_YEAR',df_bath_winter['DATE'].dt.year)
    #del df_dansville_winter['WINTER_YEAR']
    
    #Replace the year of the date field in the new column with 1 year higher if in October, November, or December
    df_bath_winter['WINTER_YEAR'] = np.where(df_bath_winter['DATE'].dt.month.isin([10, 11, 12]), df_bath_winter['WINTER_YEAR']+1, df_bath_winter['WINTER_YEAR'])
    with pd.option_context("display.max_rows", 15): display(df_bath_winter)
    
    FILE_NAME LOCATION STATION NAME DATE WINTER_YEAR PRCP SNOW SNWD TMAX TMIN TOBS
    1233 NOAA Daily Summaries - Alfred, NY.csv Bath, NY USC00300448 BATH, NY US 1953-12-01 1954 0.00 0.0 0.0 NaN NaN NaN
    1234 NOAA Daily Summaries - Alfred, NY.csv Bath, NY USC00300448 BATH, NY US 1953-12-02 1954 0.00 0.0 0.0 NaN NaN NaN
    1235 NOAA Daily Summaries - Alfred, NY.csv Bath, NY USC00300448 BATH, NY US 1953-12-03 1954 0.00 0.0 0.0 NaN NaN NaN
    1236 NOAA Daily Summaries - Alfred, NY.csv Bath, NY USC00300448 BATH, NY US 1953-12-04 1954 0.00 0.0 0.0 NaN NaN NaN
    1237 NOAA Daily Summaries - Alfred, NY.csv Bath, NY USC00300448 BATH, NY US 1953-12-05 1954 0.23 0.0 0.0 NaN NaN NaN
    ... ... ... ... ... ... ... ... ... ... ... ... ...
    23998 NOAA Daily Summaries - Alfred, NY.csv Bath, NY US1NYST0033 BATH 4.2 E, NY US 2024-03-26 2024 0.00 0.0 0.0 NaN NaN NaN
    23999 NOAA Daily Summaries - Alfred, NY.csv Bath, NY US1NYST0033 BATH 4.2 E, NY US 2024-03-27 2024 0.00 0.0 0.0 NaN NaN NaN
    24000 NOAA Daily Summaries - Alfred, NY.csv Bath, NY US1NYST0033 BATH 4.2 E, NY US 2024-03-28 2024 0.00 0.0 0.0 NaN NaN NaN
    24001 NOAA Daily Summaries - Alfred, NY.csv Bath, NY US1NYST0033 BATH 4.2 E, NY US 2024-03-30 2024 0.00 0.0 0.0 NaN NaN NaN
    24002 NOAA Daily Summaries - Alfred, NY.csv Bath, NY US1NYST0033 BATH 4.2 E, NY US 2024-03-31 2024 0.00 0.0 0.0 NaN NaN NaN

    7097 rows × 12 columns

    In [152]:
    df_bath_year_rows = df_bath_filtered.groupby(df_bath_filtered.DATE.dt.year)['DATE'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year
    df_bath_year_rows_rslt = df_bath_year_rows.loc[df_bath_year_rows['Rows'] < 365] #Create a new frame with only the years that have less than 365 rows
    
    print(len(df_bath_year_rows_rslt.index), 'years are missing some data:')
    display(df_bath_year_rows_rslt)
    
    31 years are missing some data:
    
    DATE Rows
    0 1953 153
    22 1975 90
    23 1976 263
    31 1984 335
    42 1995 363
    ... ... ...
    65 2020 173
    66 2021 75
    67 2022 199
    68 2023 333
    69 2024 185

    31 rows × 2 columns

    In [154]:
    df_bath_winter_rows = df_bath_winter.groupby(df_bath_winter.DATE.dt.year)['DATE'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year 
    df_bath_winter_rows_rslt = df_bath_winter_rows.loc[df_bath_winter_rows['Rows'] < 121] #Create a new frame with only the years that have less than 121 rows
    
    print(len(df_bath_winter_rows_rslt.index), 'years are missing some winter data:')
    display(df_bath_winter_rows_rslt)
    
    26 years are missing some winter data:
    
    DATE Rows
    0 1953 31
    22 1975 90
    23 1976 31
    42 1995 119
    44 1997 89
    ... ... ...
    64 2019 25
    65 2020 21
    66 2022 21
    67 2023 104
    68 2024 75

    26 rows × 2 columns

    This is a little disappointing

    Bath does not have robust data in the most recent years that I was hoping to study, so we will have to leave it there. Unfortunately, this is how life goes sometimes, we just won't have this data to use. Maybe in the future stations in Bath will begin recording data well again but until then we'll have to do with other stations.

    Conclusions

    We told the story along the way but, we'll summarize a bit here. I do not believe that there was enough evidence here to show large changes, but in the world of climate change it seems as if we are dealing with effects following only a change of a few degrees or less. Our ski seasons have become shorter, the snow is heavier and wetter and melts quickly, and I believe that even these threshold temperatures around the freezing point make for more dangerous driving conditions as dry snow seems easier to get mechanical grip in with car tires while wet slushy snow makes a more slick and unforgiving surface.

    I do, though, believe that the increase in days above freezing has been a strong contributing factor to the loss of the winter season here. It seems as if we are on the uptrend, in that regard, so I can only hope that we have hit a peak.